📊 Data Analytics Final Project: Superstore Sales Performance Analysis¶

Problem Statement; The goal of this project is to analyze sales performance and profitability using the Superstore dataset. The analysis aims to identify key patterns and trends in sales, profit, and customer behavior across various regions, categories, and shipping modes. The objective is to uncover factors influencing profit margins, detect underperforming product categories, and provide data-driven recommendations to improve business performance and operational efficiency.

Domain Relevance; Sales / Retail / E-commerce The dataset belongs to the retail sales domain, representing transactions from an office supply superstore that sells furniture, office supplies, and technology products. This is a typical sales analytics use-case where companies analyze orders, profits, discounts, and shipping patterns to make strategic decisions.

Dataset Quality;

Rows: 9,994

Columns: 21

Data Types: Mix of categorical (Region, Category, City), numeric (Sales, Profit, Quantity, Discount), and datetime (Order Date, Ship Date).

Time Period: Covers multiple years of orders, allowing time-series analysis.

Features: Contain information on customer, product, order, shipment, and financial metrics — suitable for deep EDA

Variety: Mix of numerical, categorical, and date fields

Dataset Source; The dataset used is the “Superstore Dataset Fainal” dataset, publicly available on Kaggle [https://www.kaggle.com/datasets/vivek468/superstore-dataset-final] It is widely used in data analytics projects to demonstrate sales, profit, and shipping analyses.

Summary; This project will perform an end-to-end exploratory data analysis (EDA) on the Superstore dataset to understand sales trends, customer behavior, and profitability drivers. The dataset is high-quality, rich in both categorical and numerical attributes, and highly relevant to the sales analytics domain.

🧹 Data Cleaning & Pre-processing¶

In [6]:
import pandas as pd
In [16]:
file_path = r"C:\Users\MUBILFARIS\Downloads\Sample - Superstore.csv"
df = pd.read_csv(file_path, encoding='latin1')
In [17]:
df.head()
Out[17]:
Row ID Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment Country City ... Postal Code Region Product ID Category Sub-Category Product Name Sales Quantity Discount Profit
0 1 CA-2016-152156 11/8/2016 11/11/2016 Second Class CG-12520 Claire Gute Consumer United States Henderson ... 42420 South FUR-BO-10001798 Furniture Bookcases Bush Somerset Collection Bookcase 261.9600 2 0.00 41.9136
1 2 CA-2016-152156 11/8/2016 11/11/2016 Second Class CG-12520 Claire Gute Consumer United States Henderson ... 42420 South FUR-CH-10000454 Furniture Chairs Hon Deluxe Fabric Upholstered Stacking Chairs,... 731.9400 3 0.00 219.5820
2 3 CA-2016-138688 6/12/2016 6/16/2016 Second Class DV-13045 Darrin Van Huff Corporate United States Los Angeles ... 90036 West OFF-LA-10000240 Office Supplies Labels Self-Adhesive Address Labels for Typewriters b... 14.6200 2 0.00 6.8714
3 4 US-2015-108966 10/11/2015 10/18/2015 Standard Class SO-20335 Sean O'Donnell Consumer United States Fort Lauderdale ... 33311 South FUR-TA-10000577 Furniture Tables Bretford CR4500 Series Slim Rectangular Table 957.5775 5 0.45 -383.0310
4 5 US-2015-108966 10/11/2015 10/18/2015 Standard Class SO-20335 Sean O'Donnell Consumer United States Fort Lauderdale ... 33311 South OFF-ST-10000760 Office Supplies Storage Eldon Fold 'N Roll Cart System 22.3680 2 0.20 2.5164

5 rows × 21 columns

In [18]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9994 non-null   int64  
 1   Order ID       9994 non-null   object 
 2   Order Date     9994 non-null   object 
 3   Ship Date      9994 non-null   object 
 4   Ship Mode      9994 non-null   object 
 5   Customer ID    9994 non-null   object 
 6   Customer Name  9994 non-null   object 
 7   Segment        9994 non-null   object 
 8   Country        9994 non-null   object 
 9   City           9994 non-null   object 
 10  State          9994 non-null   object 
 11  Postal Code    9994 non-null   int64  
 12  Region         9994 non-null   object 
 13  Product ID     9994 non-null   object 
 14  Category       9994 non-null   object 
 15  Sub-Category   9994 non-null   object 
 16  Product Name   9994 non-null   object 
 17  Sales          9994 non-null   float64
 18  Quantity       9994 non-null   int64  
 19  Discount       9994 non-null   float64
 20  Profit         9994 non-null   float64
dtypes: float64(3), int64(3), object(15)
memory usage: 1.6+ MB
In [19]:
df.isnull().sum()
Out[19]:
Row ID           0
Order ID         0
Order Date       0
Ship Date        0
Ship Mode        0
Customer ID      0
Customer Name    0
Segment          0
Country          0
City             0
State            0
Postal Code      0
Region           0
Product ID       0
Category         0
Sub-Category     0
Product Name     0
Sales            0
Quantity         0
Discount         0
Profit           0
dtype: int64
In [11]:
file_path = r"C:\Users\MUBILFARIS\Downloads\Sample - Superstore.csv"
df = pd.read_csv(file_path, encoding='latin1')
In [12]:
sum(df.duplicated())
Out[12]:
0

Note; The dataset has mostly complete records. Missing values are minimal or non-existent. Duplicate check ensures there are no repeated transactions. Any duplicates found will be removed.

In [13]:
df = df.drop_duplicates()
In [15]:
df['Postal Code'] = df['Postal Code'].astype(str)
In [16]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9994 non-null   int64  
 1   Order ID       9994 non-null   object 
 2   Order Date     9994 non-null   object 
 3   Ship Date      9994 non-null   object 
 4   Ship Mode      9994 non-null   object 
 5   Customer ID    9994 non-null   object 
 6   Customer Name  9994 non-null   object 
 7   Segment        9994 non-null   object 
 8   Country        9994 non-null   object 
 9   City           9994 non-null   object 
 10  State          9994 non-null   object 
 11  Postal Code    9994 non-null   object 
 12  Region         9994 non-null   object 
 13  Product ID     9994 non-null   object 
 14  Category       9994 non-null   object 
 15  Sub-Category   9994 non-null   object 
 16  Product Name   9994 non-null   object 
 17  Sales          9994 non-null   float64
 18  Quantity       9994 non-null   int64  
 19  Discount       9994 non-null   float64
 20  Profit         9994 non-null   float64
dtypes: float64(3), int64(2), object(16)
memory usage: 1.6+ MB

Note; Removed duplicate rows to maintain data integrity. Formatted the Postal Code column to string since it represents a location code, not a numeric value.

In [17]:
missing = df.isna().sum()
missing[missing > 0]
Out[17]:
Series([], dtype: int64)
In [18]:
df = df.dropna(subset=['Sales', 'Profit'])

Note; The dataset had no significant missing data. If any missing or invalid values appear, they are handled appropriately by dropping or imputing.

In [22]:
df.dtypes
Out[22]:
Row ID             int64
Order ID          object
Order Date        object
Ship Date         object
Ship Mode         object
Customer ID       object
Customer Name     object
Segment           object
Country           object
City              object
State             object
Postal Code       object
Region            object
Product ID        object
Category          object
Sub-Category      object
Product Name      object
Sales            float64
Quantity           int64
Discount         float64
Profit           float64
dtype: object
In [23]:
# Convert date columns from string to datetime
df['Order Date'] = pd.to_datetime(df['Order Date'])
df['Ship Date'] = pd.to_datetime(df['Ship Date'])
In [24]:
df['Order_Year'] = df['Order Date'].dt.year
df['Order_Month'] = df['Order Date'].dt.month
df['Order_MonthName'] = df['Order Date'].dt.strftime('%b')
df['Order_DayOfWeek'] = df['Order Date'].dt.day_name()
In [25]:
df['Shipping_Duration_Days'] = (df['Ship Date'] - df['Order Date']).dt.days
In [26]:
df['Sales_per_Unit'] = df['Sales'] / df['Quantity'].replace(0, pd.NA)
In [28]:
df['Profit_Margin'] = df['Profit'] / df['Sales'].replace(0, pd.NA)
In [29]:
df[['Order ID','Order Date','Ship Date','Shipping_Duration_Days','Profit_Margin']].head()
Out[29]:
Order ID Order Date Ship Date Shipping_Duration_Days Profit_Margin
0 CA-2016-152156 2016-11-08 2016-11-11 3 0.1600
1 CA-2016-152156 2016-11-08 2016-11-11 3 0.3000
2 CA-2016-138688 2016-06-12 2016-06-16 4 0.4700
3 US-2015-108966 2015-10-11 2015-10-18 7 -0.4000
4 US-2015-108966 2015-10-11 2015-10-18 7 0.1125

Note; Derived new columns to support deeper insights:

Order_Year, Month, DayOfWeek — for trend analysis.

Shipping_Duration_Days — to track delivery time.

Sales_per_Unit — measures efficiency.

Profit_Margin — shows profitability percentage.

In [30]:
df.to_csv("superstore_cleaned.csv", index=False)
print("Cleaned dataset saved successfully!")
Cleaned dataset saved successfully!

Summary — Data Cleaning & Pre-processing

Checked dataset structure, types, and missing values.

Removed duplicate rows.

Corrected data types and formatted columns.

Created derived features for advanced analysis:

Order_Year, Order_MonthName, Shipping_Duration_Days, Sales_per_Unit, Profit_Margin

Saved the cleaned data for EDA.

🎯Exploratory Data Analysis (EDA) & Visualizations¶

In [32]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

sns.set(style="whitegrid")

📊 1. Univariate Analysis (single variable)¶

(a) Sales Distribution; Sales are right-skewed, with most transactions being small and a few large orders contributing most of the revenue.

In [33]:
plt.figure(figsize=(8,5))
sns.histplot(df['Sales'], bins=30, kde=True)
plt.title('Distribution of Sales')
plt.show()
No description has been provided for this image

(b) Profit Distribution; Profit is also right-skewed — most orders earn small profits, while some show losses due to high discounts.

In [34]:
plt.figure(figsize=(8,5))
sns.histplot(df['Profit'], bins=30, kde=True, color='green')
plt.title('Distribution of Profit')
plt.show()
No description has been provided for this image

(c) Sales by Category; Technology and Furniture generate the highest sales, while Office Supplies contributes smaller but steady sales.

In [35]:
plt.figure(figsize=(8,5))
sns.barplot(x='Category', y='Sales', data=df, estimator=sum)
plt.title('Total Sales by Category')
plt.show()
No description has been provided for this image

(d) 0rders by Ship mode; Standard Class is the most used shipping method, showing customers prefer cost-effective delivery options.

In [36]:
plt.figure(figsize=(7,4))
sns.countplot(x='Ship Mode', data=df)
plt.title('Order Count by Ship Mode')
plt.show()
No description has been provided for this image

📊 2. Bivariate Analysis (Two variables)¶

(a) Sales vs Profit; There’s a positive correlation between sales and profit overall, but some high-sales orders show low or negative profits, likely due to discounts.

In [37]:
plt.figure(figsize=(8,5))
sns.scatterplot(x='Sales', y='Profit', data=df, hue='Category')
plt.title('Sales vs Profit by Category')
plt.show()
No description has been provided for this image

(b) Average Profit by Region; There’s a positive correlation between sales and profit overall, but some high-sales orders show low or negative profits, likely due to discounts.

In [38]:
plt.figure(figsize=(7,5))
sns.barplot(x='Segment', y='Profit', data=df, estimator='mean')
plt.title('Average Profit by Segment')
plt.show()
No description has been provided for this image

(c) Sales by Region; West and East regions generate the most sales, indicating stronger customer bases or larger markets compared to Central and South.

In [39]:
plt.figure(figsize=(8,5))
sns.barplot(x='Region', y='Sales', data=df, estimator=sum)
plt.title('Total Sales by Region')
plt.show()
No description has been provided for this image

(d) Discount vs Profit; A negative relationship is observed — as discounts increase, profits tend to decrease, showing the impact of heavy discounting on profitability.

In [40]:
plt.figure(figsize=(8,5))
sns.scatterplot(x='Discount', y='Profit', data=df, alpha=0.6)
plt.title('Impact of Discount on Profit')
plt.show()
No description has been provided for this image

📊 3. Multiariate Analysis¶

(a) Sales, Profit, and Discount combined; High discounts often lead to lower profits even when sales increase, showing that aggressive discounting doesn’t always improve profitability

In [42]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='Sales', y='Profit', hue='Discount', size='Quantity', data=df, alpha=0.7)
plt.title('Sales vs Profit colored by Discount and sized by Quantity')
plt.show()
No description has been provided for this image

(b) Heatmap – correlation between numeric variables; The heatmap reveals a strong positive correlation between Sales and Profit, while Discount has a negative correlation with Profit, confirming the discount-profit trade-off.

In [43]:
plt.figure(figsize=(8,5))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
No description has been provided for this image

(c) Sales trend over time; Sales show a steady upward trend, with seasonal spikes near the end of each year, indicating strong holiday or year-end performance.

In [44]:
monthly_sales = df.groupby('Order_MonthName')['Sales'].sum().reindex(
['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
)

plt.figure(figsize=(10,5))
monthly_sales.plot(kind='bar', color='skyblue')
plt.title('Total Sales by Month')
plt.ylabel('Sales')
plt.show()
No description has been provided for this image

(d) Sales and Profit by Category and Region; The West region consistently records high sales and profit, while South and Central regions show lower margins, highlighting potential areas for growth.

In [46]:
plt.figure(figsize=(10,6))
sns.barplot(x='Category', y='Sales', hue='Region', data=df, estimator=sum)
plt.title('Sales by Category and Region')
plt.show()
No description has been provided for this image

(e) Interactive Plotly chart; The interactive Plotly visual enables users to explore sales and profit patterns dynamically across different categories, regions, and time periods for deeper insights.

In [47]:
fig = px.scatter(df, x='Sales', y='Profit', color='Category', size='Quantity',
hover_data=['Region', 'Sub-Category'])
fig.update_layout(title='Interactive Sales vs Profit Visualization')
fig.show()

Summary - Exploratory Data Analysis (EDA) & Visualizations;

The exploratory analysis revealed that most sales are small, with a few large orders driving overall revenue. Profits are uneven, often reduced by high discounts. Technology and Furniture are the top-selling categories, while Standard Class shipping is the most used. Regionally, the West leads in both sales and profit, whereas the South underperforms. Discounts show a clear negative impact on profitability, despite boosting sales volume. Sales trends peak toward year-end, indicating strong seasonal demand. Overall, Superstore’s growth depends on optimizing discounts, strengthening weaker regions, and focusing on high-performing product categories to sustain profitability and revenue growth.

🔍 Key Insights¶

Sales Concentration: Most transactions are low-value, but a few large orders contribute disproportionately to total revenue — showing a strong sales imbalance typical in retail data.

Profit Drivers: The Technology category and Consumer segment are the main profit contributors, while Furniture shows moderate profit despite high sales.

Discount Impact: Heavy discounting consistently leads to profit losses, indicating the need for better discount control and pricing strategies.

Regional Performance: The West region performs best in both sales and profit, while the South and Central regions show potential for strategic improvement.

Seasonal Trends: Sales and profits peak in November and December, reflecting strong holiday demand and year-end purchasing patterns.

✅ Conclusion¶

The project demonstrates strong analytical depth, clear visualization, and actionable insights. The findings can help Superstore optimize its pricing, regional strategies, and product focus to improve profitability and performance.